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Abstract 



Unions of subspaces are powerful nonlinear signal models for collections of high-dimensional 
data. However, existing nic;thods that exploit this structure require that the subspaces the 
signals of interest occupy be known a priori or be learned from the data directly. In this 
work, we analyze the performance of greedy feature selection strategies for learning unions 
of subspaces from an ensemble of data. We develop sufficient conditions that arc required 
for orthogonal matching pursuit (OMP) to recover subsets of points from the ensemble 
that live in the same subspace, a property we refer to as exact feature selection (EFS). 
Following this analysis, we provide an empirical study of greedy feature selection strategies 
for subspace clustering and characterize the gap between sparse recovery methods and 
nearest neighbor (NN)-based approaches. We demonstrate that the gap between sparse 
recovery and NN methods is particularly pronounced when the tiling of subspaces in the 
ensemble is sparse, suggesting that sparse recovery methods can be used in a number of 
regimes where nearest neighbor approaches fail to reveal the subspace membership of points 
in the ensemble. 

Keywords: Subspace clustering, unions of subspaces, hybrid linear models, sparse ap- 
proximation, structured sparsity, nearest neighbors, low-rank approximation. 

1. Introduction 

1.1 Unions of Subspaces 

With the emergence of novel sensing systems capable of acquiring data at scales ranging 
from the nano to the peta, modern sensor and imaging data are becoming increasingly high- 
dimensional and heterogeneous. To cope with this explosion of high-dimensional data, one 
must exploit the fact that low-dimensional geometric structure exists amongst collections 
of data. 

Linear subspace models are one of the most widely used signal models for collections of 

high-dimcnsional data, with applications throughout signal processing, machine learning, 
and the computational sciences. This is due in part to the simplicity of linear models but 
also due to the fact that principal components analysis (PCA) provides a closed-form and 
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computationally efficient solution to the problem of finding an optimal low-rank approxi- 
mation to a collection of data (an ensemble of points in M"). More formally, if we stack a 
collection of d vectors (points) in M" into the columns of y G M"^'^, then PCA ffiids the 
best rank-fc estimate of Y by solving 



(PCA) 



mm 



\Y — X\\f subject to rank(X) < k. 



(1) 



In many cases, a linear subspace model is sufficient to characterize the intrinsic structure 
of the ensemble; however, in many emerging applications, a single subspace is not enough. 
Instead, ensembles must be modeled as living on a union of subspaces or a union of affine 
planes of mixed or equal dimension. Ensembles ranging from collections of images taken 



of objects under different illumination conditions (Basri and Jacobs, 2003 Ramamoorthi 



2002), motion trajectories of point-correspondences (Kanatani, 2001), to structured sparse 



and block-sparse signals (Lu and Do 2008; Blumensath and Davies 2009 Baraniuk et al. 



2010) are all well-approximated by a union of low-dimensional subspaces or a union of affine 



hyperplanes. Union of subspace models have also found utility in the classification of signals 
collected from complex and adaptive systems at different instances in time, e.g., electrical 



signals collected from the brain's motor cortex ( Gowreesunker et al. , 2011) 



Unions of subspaces provide a natural extension to single subspace models, but providing 
an extension of PCA that leads to provable guarantees for learning multiple subspaces is 
challenging. This is due to the fact that segmentation — the identification of points that 
live in the same subspace — and subspace estimation must be performed simultaneously. 
However, if we can accurately sift through the points in the ensemble and determine which 
points lie along or near the same subspace, then subspace estimation becomes trivial. For 
this reason, many state-of-the art methods for learning unions of subspaces rely on first 
forming local subspace estimate^bom a subset of points in the data. 

A common heuristic used to obtain local subspace estimates is to select points that lie 
within an Euclidean neighborhood of one another (or a fixed number of nearest neighbors 
(NN)) and then form a local estimates from multiple sets of NNs. At a high-level, most 
NN-based approaches for subspace clustering can be summarized as consisting roughly of 
the following three steps: 

(1) For the i^^ point in the set, m, select a set of points from the ensemble that live within 
an e-radius from yi in terms of their Euclidean distance. Denote this subset of points 
Y\, where A is an index set containing the indices of all the neighbors of m. 

(2) Form a low-rank PCA estimate by solving ([T]) for the points in the sub- matrix Ya- 

(3) Compute the subspace affinity matrix W G W^^'^ for the ensemble, where the {i,j) 
entry of the matrix represents the likelihood that yi and yj live close to the same 
subspace or whether yj and yj produce similar local subspace estimates. 

Methods that use NN sets to form local subspace estimates from the data include local 



subspace affinity (LSA) (Yan and Pollefeys, 2006), spectral clustering based on locally 



1. A local subspace estimate is a low-rank approximation formed from a subset of points in the ensemble, 
rather than from the entire collection of data. 
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linear approximations (Arias-Castro et al. , 2010), spectral curvature clustering (Chen and 



Lerman 2009), and local best-fit flats (Zhang et al. , 2010). The main differences between 



these methods lie either in the way that the entries of the affinity matrix are computed 
in Step 3 or the way in which this matrix is used to obtain an estimate of the underlying 
subspace structures present in the ensemble. In the case of approaches built upon spectral 



clustering (Shi and Malik, 2000), one performs spectral clustering on the subspace afhnity 



matrix for the ensemble in order to cluster the data into different subspaces. In the case of 
consensus-based approaches, one finds a robust estimate of the mode of the local subspace 
estimates formed in Step 2; this problem can also be posed as an optimization on the 
subspace affinity matrix. 

When the subspaces present in the ensemble are linearly separable or non-intersecting, 
local subspace estimates formed from NNs provide relatively reliable and stable estimates 
of the subspaces present in the ensemble. However, neighborhood-based approaches quickly 
fail as the separation between the two structures decreases and as the subspace dimension 
increases relative to the number of points in each subspace. This is due in part to the fact 
that, as the dimension of the intersection between two subspaces increases, the Euclidean 
distance between points becomes a poor predictor of which points belong to the same 
subspace. Thus, we seek an alternative strategy for forming local estimate that does not 
rely solely on whether points in the same subspace live in a local Euclidean neighborhood. 
Instead, our goal is to identify another strategy for "feature selection" that returns sets of 
points (feature sets) that lie along the same subspace. 



1.2 Exact Feature Selection 



Instead of computing local subspace estimates from sets of NNs, Elhamifar and Vidal (2009 ) 



propose a novel approach for feature selection based upon forming sparse representations of 
the data via ^i-minimization. The main intuition underlying their approach is that when 
a sparse representation of a point contained within a union of subspaces is formed with 
respect to the remaining points in the dataset, the representation should only consist of 
other points that belong to the same subspace. When a sparse representation consists of 
points that lie in the same subspace, we say that exact feature selection (EES) occurs. 
Under certain assumptions on both the sampling and "distance between subspaces" this 
approach to feature selection leads to provable guarantees that EES will occur (Elhamifar 



and Vidal, 2010 Soltanolkotabi and Candes, 2012), even when the subspaces intersect. 



We refer to this application of sparse recovery as endogenous sparse recovery due to the 
fact that representations are not formed from an external collection of primitives (such as a 
basis or dictionary) but are formed "from within" the data. Eormally, for a set of d signals 
3^ = {yi, . . . ,yd}, each of dimension n, the sparsest representation of the i^^ point yi is 
defined as 

argmin ||c||o subject to yi = '^c{j)yj, (2) 



where ||c||o counts the number of non-zeroes in its argument. Let A*^*-* = supp(c*) denote 
the subset of points selected to represent the i^^ point and c*(j) denote the contribution 

2. The distance between a pair of subspaces is typically measured with respect to the principal angles 
between the subspaces or other related distances on the Grassmanian manifold. 
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of the j^^ point to the endogenous representation of Ui. By penahzing representations that 
require a large number of non-zero coefficients, the resulting representation will be sparse. 
In general, finding the sparsest representation of a signal has combinatorial complexity, 



and so sparse recovery methods such as basis pursuit (BP) (Chen et al. 



1998 ) or a greedy 



pursuit such as orthogonal matching pursuit (OMP) (Davis et al. , 1994) are employed to 
find approximate solutions. 

1.3 Contributions 



In Elhamifar and Vidal (2010), the authors show that when subspaces are disjoint (inter- 
sect only at the origin) and the minimum principal angle between subspaces is sufficiently 
large, the points selected by BP will belong to the same subspace, i.e., EES is guaranteed. 



Recently, Soltanolkotabi and Candes (2012) developed both deterministic and probabilistic 
guarantees for EES from unions of intersecting subspaces. 

In parallel with recent developments for subspace clustering with BP, in this paper, 
we provide an analysis of endogenous sparse recovery with OMP. The main result of our 
analysis is a new geometric condition (Thm. [T]) for EES that highlights the tradeoff between 
the mutual coherence between points living in different subspaces and the covering radius of 
the points within a common subspace. The covering radius can be interpreted as the radius 
of the largest ball that can be embedded within each normalized subspace cluster without 
touching a point in the ensemble; the vector that lies at the center of this open ball, or the 
vector in the subspace that attains the covering radius is referred to as a deep hole. Thm. [T] 
suggests that subspaces can be arbitrarily close to one another and even intersect, as long 
as the data is distributed "nicely" along each subspace. By "nicely", we mean that the 
points that tile each subspace, attain a small covering radius and do not cluster together, 
leaving large gaps in the subspace. In Eig. [l] we illustrate the covering radius of a set of 
points on the sphere (the deep hole is denoted by a star). 

After introducing a general geometric condition for EES with OMP, we extend this 
analysis to the case where the data live on what we refer to as an uniformly bounded union of 
subspaces. This analysis demonstrates that when the points living in a particular subspace 
are incoherent with the principal vectors that support pairs of subspaces in the ensemble, 
EES can be guaranteed; this is true even when subspaces in the ensemble are intersecting. 
Our condition for bounded subspaces suggests that, in addition to properties related to the 
sampling of the subspace, one can characterize the separability of pairs of subspaces by 
examining the correlation between the data and the unique set of principal vectors that 
support pairs of subspaces in the ensemble. The principal vectors have a straightforward 
interpretation: the principal vectors are the collection of vectors from a pair of subspaces 
that are maximally correlated with one another but also form an orthonormal basis (ONB) 
for each of their respective subspaces. 

In addition to providing a theoretical analysis of EES, another major contribution of 
this work is revealing the "gap" between nearest neighborhood-based (NN) approaches and 
sparse recovery methods such as OMP and BP for feature selection. In particular, we 
demonstrate that sparse recovery methods admit a higher proportion of points that belong 
to the same subspace than sets of NN; this gap in performance is particularly pronounced 
when the number of points in each subspace decreases. These empirical results point to 



4 



Greedy Feature Selection for Subspace Clustering 




Figure 1: Covering radius of points in a normalized subspace. The interior of the antipodal 
convex hull of points in a normalized subspace — a subspace of M" mapped to the 
unit ^2-sphere — is shaded. The vector in the normalized subspace (circle) that 
attains the covering radius or lies within a deep hole in the subspace is denoted 
with a star: when compared with the convex hull, the deep hole coincides with 
the maximal gap between the convex hull and the set of all vectors that live in 
the normalized subspace. 

another advantage of forming sparse representations from within the data; sparse recovery 
methods provide a natural way to reveal the subspace affinity amongst points that might 
be far away from one another in a Euclidean sense. By exploiting non-local relationships 
between points, sparse recovery methods are capable of providing reliable subspace estimates 
with far fewer points than neighborhood-based estimates. 

To illustrate the role that the sampling density has on subspace clustering in real- 
world datasets, we compare the performance of NN and sparse recovery methods (BP and 
OMP) for clustering unlabeled faces that have been captured under different illumination 
conditions — the separation of points that live on different "illumination subspaces". We 
provide empirical evidence that while both NN and sparse recovery methods have compara- 
ble rates of EFS for dense samplings of each subspace, when the subspaces are more sparsely 
sampled, sparse recovery methods provide significant advantages over NN, both in terms of 
EFS and in their final classification rates. Moreover, we demonstrate that in a number of 
settings, OMP outperforms BP in terms of both EFS and clustering performance. See Fig. 
[2] for an example of the affinity matrices formed from pairs of face subspaces for NN (left), 
BP (middle), and OMP (right). 
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Figure 2: Comparison of sub space affinity matrices for illumination subspaces. In each row, 
we display the subspace affinity matrices obtained for a different pair of illumi- 
nation subspaces in the dataset, for NN (left), BP (middle), and OMP (right). 
To the left of the affinity matrices, we display an exemplar image from each 
illumination subspace. 



1.4 Paper Organization 

We now provide a roadmap for the rest of the paper. 



Section 2. We introduce our signal model, the sparse subspace clustering (SSC) algorithm 
introduced in (Elhamifar and Vidal, 2009), and describe how OMP may be used for feature 
selection in subspace clustering. 



Section 3 and 4. We develop the main theoretical results of this paper and provide new 
geometric insights into EPS from unions of subspaces. We introduce sufficient conditions 
for EPS to occur with OMP for general unions of subspaces in Thm. [T| disjoint unions in 
Cor. [T| and uniformly bounded unions in Thm. [3] 
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Section 5. We conduct a number of numerical experiments to validate our theory and 
compare feature selection with sparse recovery methods to NN-based feature selection. 
Experiments are provided for both synthetic and real data. 

Section 6. We discuss the implications of our theoretical analysis and subsequent studies 
on sparse approximation, dictionary learning, and compressive sensing. We conclude with 
a number of interesting open questions and future lines of research. 



Section 7. We supply the proofs of the results contained in Sections 3 and 4. 



1.5 Notation 

In this paper, we will work solely in real finite-dimensional vector spaces, R"". We write 
vectors x in lowercase script, matrices A in uppercase script, and scalar entries of vectors 
as x{j). The standard p-norm is defined as 




where p > \. The £o quasi- norm of a vector x is defined as the number of non-zero ele- 
ments in X. The support of a vector x, often written as supp(x), is the set containing the 
indices of its non-zero coefficients; hence, ||x||o = |supp(x)|. We denote the Moore-Penrose 
pseudoinverse of a matrix A as A'^ . If ^ = UYiV'^ then = VTi'^U'^ , where we obtain S"*" 
by taking the reciprocal of the entries in S, leaving the zeros in their places, and taking 
the transpose. An orthonormal basis (ONB) $ that spans the subspace S of dimension k 
satisfies the following two properties: = and range($) = 5, where 1^ is the k x k 

identity matrix. Let Pa = ^a^a denote an ortho-projector onto the subspace spanned by 
the sub- matrix Xj^. 



2. Greedy Feature Selection for Subspace Clustering 

In this section, we introduce our signal model, detail the sparse subspace clustering (SSC) 
method developed in Elhamifar and Vidal ( 2009 ) , and discuss the utility of greedy methods 
for feature selection in subspace clustering. 



2.1 Signal Model 

Given a set of p subspaces of M", each of dimension ki < k, which we denote as Si, . . . ,Sp, 
we generate "subspace clusters" by sampling di points from each subspace Si . Let denote 
the resulting set of points from subspace Si and let y = U^=^3^j denote the union of these 
p subspace clusters. 

To normalize the points in each subspace cluster or generate a "normalized subspace 
cluster", we map each point in y to the unit sphere. Let 



y 



yi y2 yd 



\yi\\2 \\y2\\2 \\yd\\2 
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denote the resulting set of unit norm points and denote the set of unit norm points that 
he in the span of subspace Si. Let y^i = y \ yi denote the set of points in y, where the 
points in are excluded. 

Let Y = [Yi Y2 ■ ■ ■ Yp] denote the matrix of normalized data, where each point in yi 
is stacked into the columns of Yi € M"^'^». The points in Yi can be expanded in terms of 
an orthonormal basis G M"^'^* that spans Si and subspace coefficients Ai = ^jYi, where 
Yi = ^iAi. Let Y-i denote the matrix containing the points in Y with the submatrix Yi 
excluded. 



2.2 Sparse Subspace Clustering 



In (Elhamifar and Vidal, 2009), the authors propose a novel approach to feature selection 



called sparse subspace clustering (SSC) that employs a relaxation of the ^o-™iiii™ization 
problem in ([2]). To be precise, the sparse subspace clustering (SSC) algorithm proceeds by 
solving the following problem for each point in y: 



arg mm 

eg 



(3) 



|c||i subject to J/i = ^c(j)yj. 

After finding the solution to this £i-minimization problem, each d-dimensional feature vector 
c* is placed into the i*^ row or column of the affinity matrix C and spectral clustering ( Shi 
and Malik, 2000) is performed on the graph Laplacian of = |C| + \C'^\. 



In (Elhamifar and Vidal, 2012), the authors provide an extension of SSC to the case 



where the data might not admit an exact representation with respect to other points in the 
ensemble. In this case, they employ an inequality constrained version of BP known as basis 
pursuit denoising (BPDN) for feature selection; for each point yi, they solve the following 
problem 

c* = arg min ||c||i subject to lll/i — / c{j)yj\\2 < k, (4) 
where k is a parameter that is selected based upon the amount of noise in the data. A 



robust variant of this algorithm was recently proposed in Soltanolkotabi et al. (2013) to 



perform sparse subspace clustering on data living on noisy subspaces. 



2.3 Greedy Feature Selection 

Instead of solving the endogenous sparse recovery problem in ([2]) via £i-minimization, as 
originally proposed in SSC, we will study the behavior of greedy pursuit (OMP) for endoge- 
nous recovery. We detail this procedure in Alg. [l} 

For each point yi, the output of the OMP algorithm is a feature set, A^^\ which indexes 
the columns in Y that are used to form a representation of yi. To use these feature sets 
to cluster the data as in SSC, a d-dimensional sparse feature vector q is computed by 
stacking the A:-dimensional projection q = into the entries of Cj indexed by A(*\ 

where £ M^^" is the pseudoinverse of the submatrix 1^(0 G M"^'^. The feature vector 
Ci is then stacked in the i^^ row of C € W^^'^ and the subspace affinity matrix of the ensemble 
is computed as W = |C| + \C'^\. 

After computing the subspace affinity matrix W with OMP, spectral clustering may be 
performed on the graph Laplacian of W (as in SSC), or a consensus-based method may be 
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Algorithm 1 : Greedy Feature Selection with OMP 

Input: Input signal yj, a matrix containing d data points Y G M"^*^, and a stopping 
criterion. 

Output:A feature set A^*-* containing the indices of all points chosen in the pursuit. 
Initialize: Set the residual to the input signal r = 

1. Select the point that is maximally correlated with the residual and add it to A^*^ 

yV(*) ^ A^*) U arg max|(yj,r)|. 

2. Update the residual by projecting the signal into the space orthogonal to the span of 
the points in A^*^ 

3. Repeat steps (l)-(2) until the stopping criterion is reached, e.g., either |A(*)| = /c or 
the norm of the residual ||r|| < 5. 



employed. In (Dyer 2011), we introduce a consensus-based algorithm for subspace cluster- 
ing and provide a comparison of consensus and clustering-based approaches for subspace 
clustering with greedy feature selection. 

Although OMP is known to be suboptimal for standard applications of sparse signal 
recovery, our results suggest that greedy pursuits provide a powerful alternative to convex 
optimization based approaches such as BP and BPDN. An obvious advantage of using greedy 
methods is that they exhibit reduced computational complexity when compared to convex 
optimization-based approaches. In addition, we observe that OMP recovers representations 
that "fill in" subspace affinity matrices more uniformly than BP. This in turn results in 
affinity matrices that are easier to segment — as measured by the ratio of the sum of the 
edges within a cluster to the sum of the edges across different clusters. See Fig. [2] for an 
example of the affinity matrices obtained via OMP, BP, and NN for collections of images of 
faces under different lighting conditions in the Yale B database (Georghiades et al. [2001 ). 



In Section 5.4, we provide empirical evidence that sparse recovery methods (i.e., OMP 
and BP) provide significant improvements over NN in terms of EFS and the clustering 
performance of the three methods on real and synthetic data. These empirical findings, 
coupled with the fact that OMP exhibits reduced computational complexity, suggest that 
OMP provide significant advantages over BP in spectral clustering-based approaches to 
subspace clustering. 



3. Exact Feature Selection from Unions of Subspaces 

In this section, we provide a formal definition of EFS and develop sufficient conditions that 
guarantee that EFS will occur for all of the points contained within a particular subspace 
cluster. 
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3.1 Exact Feature Selection 

As we alluded to in the Introduction, in order to guarantee that OMP returns a sample 
set that yields accurate local subspace estimates, we will be interested in determining when 
the feature set A^*) returned by Alg. [l] only contains other points that belong to the same 
subspace. When the feature set only contains points that live in the same subspace, we say 
that exact feature selection (EFS) occurs. We now supply a formal definition. 

Definition 1 (Exact feature selection) Let yk = {y '■ {I — Pk)y = 0, y G 3^} index the 
set of points in y that live in the span of subspace Sk, where Pt is a projector onto the span 
of subspace S^. For a point yi € y^ with feature set A^*), we say that A^'^^ contains exact 
features if yj C y,,, Vj G A^*). 

EFS is essential for studying the performance of algorithms for unsupervised subspace 
learning problems, because when EFS occurs for a point in the set, this will yield a subspace 
estimate that coincides with one of the true subspaces contained within the data. For this 
reason, EFS provides a natural condition for studying the performance of both subspace 
consensus and spectral clustering methods. Note that this definition allows for points that 
lie in the intersection between two planes to belong to either subspace. Thus, EFS can still 
be satisfied for points at an intersection granted they admit feature sets that are restricted 
to a single subspace cluster. 

3.2 Geometric Conditions for EFS 

We will now develop our main results which provide sufficient conditions for EFS to occur 
for all points in a particular subspace cluster. 

3.2.1 Preliminaries 

Our main result in Thm. [T] below requires measures of both the distance between points 
in different subspace clusters and in the same subspace cluster. A natural measure of the 
similarity between points living in different subspaces is the mutual coherence. A formal 
definition of the mutual coherence is provided below in Def. [2j 

Definition 2 The mutual coherence between the points in the sets (3^j,3^j) is defined as 

fJ^ciyijyj) = max \{u,v)\, where ||n||2 = ||w||2 = 1- (5) 

In words, the mutual coherence provides a point- wise measure of the normalized inner 
product (coherence) between all pairs of points that lie in two different subspace clusters. 

Let fic{yi) denote the maximum mutual coherence between the points in yi and all other 
subspace clusters in the ensemble, where 

Hc{yi) = max Hc(yi,yj)- 

A related quantity that provides an upper bound on the mutual coherence between 
points in two subspaces is the cosine of the first principal angle between the subspaces. The 
first principal angle 0*- between subspaces Si and Sj of dimension ki and kj respectively, is 
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the smallest angle between a pair of unit vectors {ui , vi ) drawn from Si x Sj . Formally, the 
first principal angle is defined as 

6*j = min arccos {u,v) subject to ||m||2 = 1, ||f II2 = 1- (6) 

Whereas the mutual coherence provides a measure of the pairwise similarity between a 
finite set of unit norm vectors that live in different subspaces, the cosine of the minimum 
principal angle provides a measure of the pairwise similarity between any pair of unit norm 
vectors that lie in the span of Si x Sj. For this reason, the cosine of the first principal angle 
provides an upper bound on the mutual coherence. The following upper bound is in effect 
for each pair of subspace clusters in the ensemble: 

^^c{y^,yJ)<cos{e*^). (7) 

To measure the degree to which the points in the same subspace cluster provide a tiling 
of their span, we will study the covering radius of each subspace cluster relative to the 
projective distance. Formally, the covering radius of the set is defined as 

cover(3^fc) = max min dist(w,y), (8) 

u&Sk y&Vk 

where the projective distance between two vectors u and y is defined relative to the acute 
angle between the vectors 

distKy) = ./l-^^j4r- 

The covering radius of the normalized subspace cluster can be interpreted as the size of 
the largest open ball that can be placed in the set of all unit norm vectors that lie in the 
span of Si, without touching a point in 3^,. 

Let {u*,y*) denote a pair of vectors that attain the maximum covering diameter for 3^,, 
where u* S Si lies in a deep hole in along Si. The covering radius can be interpreted as 
the sine of the angle between the deep hole u* and its nearest neighbor y* G J^j. We show 
the geometry underlying the covering radius in Figure [T| 

In the sequel, we will be interested in the maximum (worst-case) covering attained over 
all di sets formed by removing a single point from yi. We supply a formal definition in Def. 

El 

Definition 3 The maximum covering diameter e of the set y^ along the subspace Si is 
defined as 

e = max 2 cover({3^j \ yj}). 

j=l,...,di 

Hence, the covering radius equals e/2. 

A related quantity is the inradius of the set yi, or the cosine of the angle between a 
point in and any point in Si that attain the covering radius. The relationship between 
the covering diameter e and inradius r(3^j) is given by 



r{yi) = ^l-j. (10) 
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A geometric interpretation of the inradius is that it measures the distance from the origin to 
the maximal gap in the antipodal convex hull of the points in 3^j. The geometry underlying 
the covering radius and the inradius is displayed in Figure [T] 

3.2.2 General Result for EFS 

We are now equipped to state our main geometric result for EFS with OMP. The proof is 
contained in Section [7. 1[ 



Theorem 1 Let e denote the maximal covering diameter of the subspace cluster yi as 
defined in Def. A sufficient condition for EFS to occur for all points in yi is that the 
mutual coherence 



e2 e 



< Vl---^maxcos(^^), (11) 



where 



9*j is the minimum principal angle defined in 



In words, this condition requires that the mutual coherence between points in different 
subspaces is less than the difference of two terms that both depend on the covering radius 



of points along a single subspace. The first term on the RHS of (11) is the inradius, as 



defined in (10); the inradius provides a measure of the coherence between two points that 



attain the covering radius of the subspace cluster. The second term on the RHS of (11) is 
the product of the cosine of the minimum principal angle between pairs of subspaces in the 
ensemble and the covering diameter e of the points in y^. 

When subspaces in the ensemble intersect, i.e., cos{6*j) = 1, condition (11) in Thm. [T] 
can be simplified to 



Mc(3'.)<Vl-T-yi5-V'-4-i:86- <'^' 

In this case, EFS can be guaranteed as long as the points in different subspace clusters are 
bounded away from intersections between subspaces. When the covering radius shrinks to 
zero (full sampling of subspace), Thm. [T] requires that /Xc < 1, or that points from different 
subspaces do not lie exactly in the subspace intersection, i.e., are identifiable from one 
another. 

3.2.3 Geometry underlying EFS 

The main idea underlying the proof of Thm. [T] is that, at each iteration of Alg. [T| we require 
that the residual used to select a point to be included in the representation is closer to a 
point in the correct subspace cluster (X) than a point in an incorrect subspace cluster (3^-i). 
To be precise, we require that the normalized inner product of the residual r and all points 
outside of the correct subspace cluster 

max < r{y,), (13) 



at each iteration of Alg. [T| To provide the result in Thm. [T| we require that ( 13 ) holds for 
all possible residuals, or for all r £ Si. 
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(b) 



Figure 3: Geometry underlying EFS. A union of two disjoint subspaces of different dimen- 
sion: the antipodal convex hull of a set of points (red circles) living on a 2D sub- 
space is shaded (green). In (a), we show an example where EFS is guaranteed — 
the projection of points along the ID subspace lie inside the shaded convex hull of 
points in the plane. In (b), we show an example where EFS is not guaranteed — 
the projection of points along the ID subspace lie outside the shaded convex 
hull. 

A geometric interpretation of the EFS condition in Thm. [l] is that the projection of 
points in an incorrect subspace cluster onto the subspace spanned by the correct subspace 
cluster must lie within the antipodal convex hull of the points in the correct subspace 
cluster. To see this, consider the projection of the points in incorrect subspace clusters onto 
subspace Si. Let z* denote the point on subspace Si that is closest to the signal yj G y~i, 



Za = arg mm 



We can also write this projection in terms of a orthogonal projection operator Pi = 
where $j is an ONB that spans Si and z* = Piyj. 

By definition, the normalized inner product of the residual with points in incorrect 
subspace clusters is upper bounded as 



max 



< max 



\{zhyj)\ 



max cosZ{z*,y,} 



(14) 



Thus to guarantee EFS, we require that the cosine of the angle between all signals in 
and their projection onto Si is less than the inradius of 3^j. Said another way, the EFS 
condition requires that the length of all projected points be less than the inradius of 3^j. 

In Fig. [Sj we provide a geometric visualization of the EFS condition for a union of 
disjoint subspaces (union of a ID subspace with a 2D subspace). In (a), we show a case 
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where EFS is guaranteed because the projection of points in the incorrect subspace (hue) 
onto the correct subspace (plane) Ues well within the antipodal convex hull of the points 
in the plane. In (b), we show a case where EFS is not guaranteed because the projection 
of points in the incorrect subspace (line) onto the correct subspace (plane) lies outside the 
antipodal convex hull of the points in the plane. 



3.2.4 EFS FOR Disjoint Subspaces 



When the subspaces in the ensemble are disjoint, i.e., cos{9*,) < 1, Thm.fTlcan be simplified 
further by using the bound for the mutual coherence in Inn. This simpmication results in 
the following corollary. 



denote the first principal angle between disjoint subspaces Si and Sj, 



Corollary 1 Let 

and let e denote the maximal covering diameter of the points in y., 
for EFS to occur for all points in yi is that 



max cos(^*,) 



< 



1 + e/^' 



A sufficient condition 



(15) 



3.3 Connections to Previous Work 

In this section, we will connect our results for OMP with previous analyses of EFS with BP 



provided in (Elhamifar and Vidal, 2010 Soltanolkotabi and Candes, 2012) for disjoint and 



intersecting subspaces respectively. Following this, we will contrast the geometry underlying 
EFS with exact recovery conditions used in standard applications of sparse recovery methods 



(Tropp, 2004 2006). 



3.3.1 Subspace Clustering with BP 



In (Elhamifar and Vidal, 2010), the authors develop the following sufficient condition for 



EFS to occur for BP from a union of disjoint subspaces. 



max cos (6** ) 



< 



max 



(16) 



where Wj is the set of all full rank sub-matrices Yi G M"^'^» of the data matrix Yi G M"-^*^* 
and cr^[^{Yi) is the minimum singular value of the sub-matrix Yi. Since we assume that 
all of the data points have been normalized, crmin(^i) < 1; thus, the best case result that 
can be obtained is that the minimum principal angle, cos{0*j) < 1/^/hi. This suggests that 
the minimum principal angle of the union must go to zero, i.e., the union must consist of 
orthogonal subspaces, as the subspace dimension increases. 

In contrast to the result in (16), our result for disjoint subspaces in Cor. [l] does not 
depend on the subspace dimension. Rather, our result requires that there are enough 
points in each subspace to achieve a sufficiently small covering; in which case, EFS can be 
guaranteed for disjoint subspaces of any dimension. 



In (Soltanolkotabi and Candes, 2012), the authors develop the following sufficient con- 



dition for EFS with BP from unions of intersecting subspaces: 



ij'v{yi) 



max 



(17) 
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where the matrix y^*-* G K^^i^" contains the dual directions (dual vector of each point in 
embedded in M") in its columns and r(3^j) is the inradius as defined in (10). In words, 



( 17) requires that the maximum coherence between any point in 3^_j and the dual directions 



contained in V^'^'^ must be less than the inradius of the points in 

Without going into the details of the interpretation of the dual directions V^'^\ the 



relationship between the coherence measure Hviyi) provided in ( Soltanolkotabi and Candes 



2012) and the mutual coherence /Xc(3^j) used in our analysis, is not apparent. However, when 



the points in each subspace cluster are distributed uniformly and at random along each 
subspace, the dual directions will also be distributed uniformly along each subspacej^ In 
this (yi) win be roughly equivalent to the mutual coherence. When iic{yi) ~ 

the result in ([Tt]) can be expressed in terms of the mutual coherence as //c(3^i) < f{yi)- 



This simplification reveals the connection between the result in (Soltanolkotabi and 



Candes, 2012) for BP and the condition in Thm. [T]for OMP. In particular, our result for 
OMP requires that the mutual coherence is smaller than the inradius minus an additional 
term that depends on the product of the minimum principal angle between subspaces and a 
term that is linear in the covering diameter e. Thus, the deterministic guarantee provided in 



(Soltanolkotabi and Candes, 2012) suggests that EFS may be guaranteed for BP for larger 
covering diameters than those required for OMP. However, when the covering diameter 
e — (as the subspace becomes more densely sampled), the guarantees for BP and OMP 
are roughly equivalent ( assuming that /x-^ ^ /^c)* 

While our current results for OMP are slightly more restrictive than those obtained for 
BP, since EFS is only sufficient and not necessary for subspace recovery, we maintain that 
OMP offers a powerful alternative to BP for endogenous sparse recovery. This is due in 
part to the reduced computational complexity of OMP; in addition, we provide empirical 
evidence that demonstrates that OMP outperforms BP both in terms of EFS and with 
respect to the final clustering performance of each method on real data. Our empirical 
findings suggest that our worst-case results for OMP are quite pessimistic and that EFS 
occurs in practice for even sparser samplings of subspaces in the ensemble and/or higher 
degrees of overlap between subspaces (larger mutual coherence) . 



3.3.2 Exact Recovery Conditions for Sparse Recovery 

To provide further intuition about EFS in endogenous sparse recovery, we will compare the 
geometry underlying the EFS condition with the geometry of the exact recovery condition 



(ERC) for sparse signal recovery in (Tropp 2004, 2006). 



In standard analyses of sparse recovery algorithms, one assumes that a signal y € M" 
has been synthesized from a linear combination of atoms from a particular sub-dictionary 
$A £ M"^'^. To guarantee exact support recovery for y, we must ensure that we can uniquely 
recover an approximation of y that consists solely of atoms from $a- To guarantee that 



a representation does not contain any atoms outside of the correct support set A, Tropp 



(2004) introduced a general exact recovery condition (ERC) that is sufficient for support 
recovery using both BP and OMP. 



Soltanolkotabi and Candes 



( 2012 1 for a formal definition of the dual directions and insight 



3. SeeDef. 2.2 in 
into the geometry underlying their guarantees for EFS via BP. 

4. This approximation is based upon personal correspondence with M. Soltankotabi, one of the authors 



that developed the results for EFS with BP in Soltanolkotabi and Candes (20121. 
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Theorem 2 (Tropp, 2004) For any signal supported over the sub- dictionary <I>a, exact 



support recovery is guaranteed for both OMP and BP if 

ERC(A) = max ||$a Villi < 1- (18) 

i^A 

A geometric interpretation of the ERC is that it provides a measure of how far a projected 
atom outside of the optimal set A lies from the antipodal convex hull of the atoms in A. 
When a projected atom lies outside of the antipodal convex hull formed by the set of points 
in the sub-dictionary <I>a, then the ERC condition is violated and support recovery is not 
guaranteed. For this reason, the ERC requires that the maximum coherence between the 
columns of $ is sufficiently low or that is incoherent. 

While the ERC condition requires a global incoherence property on all of the columns of 
<I>, the EES condition requires a local incoherence property. In particular, the EES condition 
requires that the projection of atoms in an incorrect subspace cluster 3^_j onto Si must be 
incoherent with deep holes in along Si. In addition, we need that the points within a 
subspace cluster are as coherent as possible in order to produce a small covering radius. 



4. EFS for Uniformly Bounded Unions of Subspaces 

In this section, we study the connection between EES and the higher-order principal angles 
(beyond the minimum angle) between pairs of intersecting subspaces. 



4.1 Subspace Distances 

To characterize the "distance" between pairs of subspaces in the ensemble, the principal 
angles between subspaces will prove useful. As we saw in the previous section, the first 
principal angle 6*j between subspaces Si and Sj of dimension ki and kj is defined as the 
smallest angle between a pair of unit vectors {ui,vi) drawn from Si x Sj. The vector pair 
(n^,fj|') that attains this minimum is referred to as the first set of principal vectors. The 
second principal angle is defined much like the first, except that the second set of principal 
vectors that define the second principal angle are required to be orthogonal to the first set of 
principal vectors (u^ , ) • The remaining principal angles are defined recursively in this way. 
The sequence of k = mm{ki,kj) principal angles, ^ ^ " " " ^ ^fc-i) is non-decreasing 
and all of the principal angles lie between [0, tt/2]. 

The definition above yields some insight into what the principal angles/vectors tell us 
about the geometry underlying a pair of subspaces; in practice, however, the principal 
angles are not computed in this recursive manner. Rather, a computationally efficient way 
to compute the principal angles between two subspaces is to first compute the singular 
values of the matrix G = ^J^j, where $i G ]^"xfci an ONB that spans subspace Si. We 
will refer to this set of singular values aij G [0, 1]^ as the cross-spectra of the subspace pair 
{Si,Sj), where k = min(fcj, kj) is the minimum dimension of the pair of subspaces of interest. 
The m}^ smallest principal angle is related to the m}^ largest value of the cross-spectra via 
the following relationship, cos(^jj(m)) = aij{m), where aij{m) is the m}^ largest entry of 
the cross-spectra. 

A pair of subspaces is said to be disjoint if the minimum principal angle is greater 
than zero. Non-disjoint or intersecting subspaces are defined as subspaces with minimum 
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principal angle equal to zero. The dimension of the intersection between two subspaces 
is equivalent to the number of entries of the cross-spectra that are equal to one. We 
also consider a slightly weaker notion of intersection that we call the overlap between two 
subspaces. The overlap q is defined as the rank(G) or equivalently, q = \\aij\\o. Thus, 
the overlap is greater than or equal to the dimension of the intersection between a pair of 
subspaces, q > dim(5j n Sj). 

4.2 Sufficient Conditions for EFS 

The sufficient condition for EFS from disjoint subspaces in Cor. [T] reveals an interesting 
relationship between the covering radius and the minimum principal angle between pairs 
of subspaces in the ensemble. However, we have yet to reveal any dependence between 
EFS and higher-order principal angles. To make this connection more apparent, we will 
make additional assumptions about the distribution of points in the ensemble, namely that 
pairs of subspaces in the ensemble are uniformly bounded relative to the principal vectors 
supporting each pair of subspaces. The principal vectors can be interpreted as a pair of 
ONBs in M"" that span Si and 52 and are maximally coherent with one another. 

To make this precise, consider the SVD of the matrix G = ^J^j = UT^V"'" , where 

are arbitrary ONBs that span subspaces Si and Sj , respectively. 
Without loss of generality, assume that the dimension of Si is greater than or equal to the 
dimension of Sj or that ki > kj. Let U = ^iU and V = ^jV denote the set of left and right 
principal vectors respectively, and y = [1^ Yj] be the matrix of data points, where Yi and 
Yj contain the points in Si and Sj, respectively. 

When the points in each subspace are incoherent with the principal vectors in the 
columns of U and V, we say that the ensemble Y is an uniformly bounded union of subspaces. 
Formally, we require the following incoherence property holds: 



Y/U\U\\YfV\\^) <j, (19) 

where || • ||oo is the entry-wise maximum and 7 E (0,1]. This property requires that the 
inner products between the principal vectors that span a particular subspace and the points 
in the subspace are all bounded by a fixed constant. In other words, for a union to have 
a small bounding constant, each point must be have a dense (uniform) distribution with 
respect to the the principal vectors that provide an ONE for the subspace. 



When the points in each subspace are distributed such that (19) holds, we can rewrite 



the mutual coherence between any two points from different subspaces to reveal its depen- 



dence on higher-order principal angles. In particular, we show (in Section 7.2) that the 
coherence between the residual which is used to select the next point to be included in the 
representation, and a point in yj at a particular iteration of the Alg. [T] is upper bounded 
by 

\(r v)\ 

max ' , ',f^' < 7||o-r;||i, (20) 
y&Vj \\rh 

where 7 is the bounding constant, aij is the cross-spectra of the subspaces Si and Sj, 
and llfTjj Ili is the £i-norm of the cross-spectra or equivalently, the trace norm of G = ^f^j. 



Using the bound in ( 20 ) , we arrive at the following sufficient condition for EFS for uniformly 



bounded unions of subspaces. 
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Theorem 3 Let Y he a uniformly hounded union of suhspaces as defined in (19), where 
r = rank(G), and ^ < \J\Jr ■ Let aij denote the cross-spectra of the suhspaces Si andSj. A 
sufficient condition for EFS to occur for all of the points in is that the covering diameter 



e < min \ I- T^lkiilli- 



In words, this condition requires that the covering diameter of each subspace and the bound- 
ing constant of the union must both be sufficiently small in order to guarantee EFS. In order 
for an ensemble to have a small bounding constant, each point must have a balanced repre- 
sentation with respect to each the principal vectors that span their respective subspaces. To 
be precise, if we look at the coefficients /3 resulting from projecting each point y £ Si onto 
the principal vectors than span Si, where (3 = U^y, then to reduce the bounding constant, 
we need the energy in f3 to be spread as much as possible. 

Our worst-case analysis assumes that the nonzero entries of the cross-spectra are equal, 
and thus each pair of supporting principal vectors are equally important in determining 
whether the ensemble will admit EFS. However, this assumption is not true in general. 
When the union is supported by principal vectors with non-uniform principal angles, our 
analysis suggests that a weaker form of incoherence is required. Instead of requiring inco- 
herence with all principal vectors, the data must simply be incoherent with the principal 
vectors that correspond to small principal angles. This means that as long as points are 
not concentrated along the principal directions with small principal angles (i.e., intersec- 
tions), then EFS can be guaranteed, even when subspaces exhibit non-trivial intersections. 



To test this prediction, we will study a bounded energy model for the data in Section 5.2 
We show that when the dataset is sparsely sampled (larger covering radius), reducing the 
amount of energy that points contain in the intersections between two subspaces, increases 
the probability that points admit EFS dramatically. 

Finally, our analysis of bounded unions suggests that the decay of the cross-spectra is 
likely to play an important role in determining whether points will admit EFS or not. More- 
over, unions with different cross-spectral decay properties are likely to behave differently 
in terms of their respective probability of admitting EFS. To test this hypothesis, we will 
study the role that the structure of the cross-spectra plays in EFS in Section |5.3[ 

5. Experimental Results 

In our theoretical analysis of EFS in Sections [3] and [4j we revealed an intimate connection 
between the covering radius of subspaces and the distribution of principal angles between 
pairs of subspaces in the ensemble. In this section, we will conduct an empirical study to 
explore these connections further. In particular, we will study the probability of EFS as we 
vary the covering radius as well as the dimension of the intersection and/or overlap between 
subspaces. In addition, we will study the role that the: (i) structure of the cross-spectra 
and (ii) amount of energy points have in the intersection between subspaces, have on EFS 
as predicted in our discussion of Thm. [3j 
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5.1 Generative Model for Synthetic Data 

In order to study EFS for unions of subspaces with varied cross-spectra, we will generate 
data from unions of overlapping block-sparse signals. 

5.1.1 Constructing Sub-dictionaries 

We construct a pair of sub-dictionaries as follows: Take two subsets f^i and 0,2 of k atoms 
from a dictionary D of d n-dimensional vectors (atoms) {dm}m=i' where dm G and 
l^^il = 1^2! = k. Let ^' G W^^^ denote the subset of atoms indexed by the first set 



= Dfi-^, and let $ G W^^ denote the subset of atoms indexed by the second set ^ = Dq^. 
Our goal is to select and $ such that G = "^^^ is diagonal, i.e., {il>i,4>j) = 0, if i 7^ j, 
where V'i is the i^^ element in ^ and (pj is the j**^ element of <I>. In this case, the cross-spectra 
is defined as o" = diag(G), where a G [0, 1]'^. For each union, we fix the "overlap" or the 
rank of G = ^^'^ to a constant between zero (orthogonal) and k (maximal overlap), where 
the overlap q G [0, k) and 6 = q/k is the overlap ratio between a pair of subspaces. 

To generate a pair of /c-dimensional subspaces with a g-dimensional overlap, we can pair 
the elements from ^' and <I> such that the i^^ entry of the cross-spectra equals 



aii) 



|(V'i,0j)| if 1 < i < 

if i = q + 1 < i < k. 



We can leverage the banded structure of shift-invariant dictionaries, e.g., dictionary 
matrices with localized Toeplitz structure, to generate subspaces with structured cross- 
spectra as follows First, we fix a set of k incoherent (orthogonal) atoms from our shift- 
invariant dictionary, which we place in the columns of ^. Now, holding ^ fixed, we set 
the i*^ atom (pi of the second sub-dictionary <I> to be a shifted version of the i^^ atom Tpi 
of the dictionary To be precise, if we set ipi = d^, where dm is the m^^ atom in our 
shift-invariant dictionary, then we will set (pi = dm+A for a particular shift A. By varying 
the shift A, we can easily control the coherence between tpi and ipi. In Fig. [4j we show 
an example of one such construction for k = q = 5. Since a G (0, 1]*^, the worst-case 
pair of subspaces with overlap equal to q is obtained when we pair q identical atoms with 
k — q orthogonal atoms. In this case, the cross-spectra attains its maximum over its entire 
support and equals zero otherwise. For such unions, the overlap q equals the dimension of 
the intersection between the subspaces. We will refer to this class of block-sparse signals as 
orthohlock sparse signals. 

5.1.2 Coefficient Synthesis 

To synthesize a point that lives in the span of the sub-dictionary ^ G M"^'^, we combine 
the elements {V'l, • . . , ipk} and subspace coefficients {a(l), . . • , a{k)} linearly to form 

k 



5. While shift-invariant dictionaries appear in a wide range of appHcations of sparse recovery (Mailhe et al 



2008 Dyer et al. 20101, we introduce the idea of using shift-invariant dictionaries to create structured 



unions of subspaces for the first time here. 
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Figure 4: Generating unions of subspaces from shift-invariant dictionaries. An example of 
a collection of two sub-dictionaries of five atoms, where each of the atoms have 
a non-zero inner product with one other atom. This choice of sub-dictionaries 
produces a union of disjoint subspaces, where the overlap ratio 6 = q/k = 1. 



where a(j) is the subspace coefficient associated with the j^^ column in ^. Without loss 
of generality, we will assume that the elements in ^ are sorted such that the values of 
the cross-spectra are monotonically decreasing. Let yf = Yl'j=i''Pj'^iU) "common 
component" of Ui that lies in the intersection between and let yf = Yl'j=q+i''l'j'^ij) 

denote the "disjoint component" of yi that lies in the space orthogonal to the intersection 
between the subspaces. 

For our experiments, we consider points drawn from one of the two following coefficient 
distributions. 

1. Uniformly Distributed on the Sphere: Generate subspace coefficients according to a 
standard normal distribution and map the resulting point to the unit sphere 

^ • tpja{j) 
II Lj V'jaUjlb 

2. Bounded Energy Model: Generate subspace coefficients uniformly on the sphere and 
rescale each coefficient in order to bound the energy in the common component 



-^Vi + (1 - 



\yf\\2 



By restricting the total energy that each point has in its common component, the bounded 
energy model can be used to produce ensembles with small bounding constant to test the 
predictions in Thm. [3} 

5.2 Phase Transitions for EES 

The goal of our first experiment is to study the probability of EFS — the probability that a 
point in the ensemble admits exact features — as we vary both the number and distribution 
of points in each subspace as well as the dimension of the intersection between subspaces. 
For this set of experiments, we generate a union of orthoblock sparse signals, where the 
overlap equals the dimension of the intersection. 

Along the top row of Fig. [5| we display the probability of EFS for points drawn from 
a uniform distribution on the sphere: the probability of EFS is computed as we vary the 
overlap ratio 5 = q/k G [0, 1] in conjunction with the oversampling ratio p = k/ d & [0, 1], 
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k = 20 k = 50 




s s 

Figure 5: Probability of EPS for different coefficient distributions. The probability of EFS 
for a union of two subspaces of dimension A; = 20 (left column) and A; = 50 (right 
column). The probability of EFS is displayed as a function of the overlap ratio 
6 G [0, 1) and the logarithm of the oversampling ratio log(p) (top row) and the 
mutual energy r = Hydb (bottom row) . 



where q = rank($^$2) is equal to the dimension of the intersection between the subspaces, 
and d is the number of points per subspace. Along the bottom row of Fig. [5} we display 
the probability of EFS for points drawn from a bounded energy model: the probability of 
EFS is computed as we vary the overlap ratio 5 and the mutual energy r G [0, 1). For these 
experiments, the subspace dimension is set to A; = 20 (left) and A; = 50 (right). To see the 
phase boundary that arises when we approach critical sampling (i.e., p ~ 1), we display 
our results in terms of the logarithm of the oversampling ratio. For these experiments, the 
results are averaged over 500 trials. 

As our theory predicts, the oversampling ratio has a strong impact on the degree of 
overlap between subspaces that can be tolerated before EFS no longer occurs. In partic- 
ular, as the number of points in each subspace increases (covering radius decreases), the 
probability of EFS obeys a second-order phase transition, i.e., there is a graceful degrada- 
tion in the probability of EFS as the dimension of the intersection increases. When the 
pair of subspaces are densely sampled, the phase boundary is shifted all the way io 5 = 0.7. 
This is due to the fact that as each subspace is sampled more densely, the covering radius 
becomes sufficiently small to ensure that even when the overlap between planes is high, EFS 
still occurs with high probability. In contrast, when the subspaces are critically sampled, 
i.e., the number of points per subspace d ~ A;, only a small amount of overlap can be toler- 
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Figure 6: Probability of EFS for unions with structured cross-spectra. Along the top row, 
we show the cross-spectra for different unions of bfock-sparse signals. Along the 
bottom row, we show the probability of EFS as we vary the overlap ratio 5 G [0, 1] 
for OMP (sohd) and NN (dash). 



ated, where 6 < 0.1. In addition to shifting the phase boundary, as the oversampling ratio 
increases, the width of the transition region increases. 

Along the bottom row of Fig. [5j we study the impact of bounding the mutual energy 



of points in the ensemble as discussed in Section 4.2, In this experiment, we fix the over- 
sampling ratio to p = 0.1 and vary the mutual energy r in conjunction with the dimension 
of the intersection. By reducing the mutual energy of the union, the phase boundary for 
the uniformly distributed data in the top row is shifted from 6 = 0.45 to 6 = 0.7 for both 
k = 20 and k = 50. This result confirms our predictions in the discussion of Thm. [3]that 
by reducing the amount of energy that points have in their subspace intersections EFS will 
occur for higher degrees of overlap. Another interesting finding is that, once the mutual 
energy r reaches a threshold, the phase boundary remains constant and reducing the energy 
further has no impact on the phase transitions for EFS. 

5.3 Comparison of Feature Selection with OMP and NN 

In this section, we compare the probability of EFS via OMP with the feature sets ob- 
tained via nearest neighbors (NN). First, we compare the performance of these methods for 
unions with varying cross-spectra. Second, we compare the phase transitions for unions of 
orthoblock sparse signals as we vary the overlap and number of points per subspace. 

For our experiments, we generate pairs of subspaces with structured cross-spectra as 
described in Section 5.1.1[ The cross-spectra arising from three different unions of block- 



sparse signals are displayed along the top row of Fig. |6j On the left, we show the cross- 
spectra for a union of overlapping orthoblock sparse signals with overlap ratio 6 = 0.75, 
where g = 15 and k = 20. The cross-spectra obtained from pairing shifted Lorentzian and 
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exponential atoms are displayed in the middle and right columns, respectively. In Fig.[6j we 
show the probability of EFS for OMP and NN for each of these three subspace unions as we 
vary the overlap q. To do this, we generate subspaces by setting their cross-spectra equal to 
the first q entries equal to the cross-spectra in Fig. [6] and setting the remaining k — q entries 
of the cross-spectra equal to zero. Each subspace cluster is generated by sampling d = 100 
points uniformly and at random along each normalized subspace cluster, as described in 
Section 15.1.21 

The gap between the probability of EFS for OMP and NN is markedly different for 
each of the three unions. In the first union of orthoblock sparse signals, the probability of 
EFS for OMP lies strictly above that obtained for the NN method, but the gap between 
the performance of both methods is relatively small. In the second union, both methods 
maintain a high probability of EFS, with OMP admitting nearly perfect feature sets even 
when the overlap ratio is maximal. In the third union, we observe that the gap between 
EFS for OMP and NN is most pronounced. In this case, the probability of EFS for NN sets 
decreases to 0.1, while OMP admits a very high probability of EFS, even when the overlap 
ratio is maximal. 

This study provides a number of interesting insights into the role that higher-order 
principal angles between subspaces play in feature selection for both sparse recovery methods 
and NN. In particular, we observe that when data is distributed uniformly with respect to 
all of the principal directions between a pair of subspaces and the cross-spectra is sub- 
linear, then EFS may be guaranteed with high probability for all points in the set provided 
the sampling density is sufficiently high. This is in agreement with the discussion of EFS 
bounded unions in Section |4.2[ Moreover, these results further support our claims that 
in order to truly understand and predict the behavior of endogenous sparse recovery from 
unions of subspaces, we require a description that relies on the entire cross-spectra. 

In Fig. [Tj we display the probability of EFS for OMP (left) and sets of NN (right) as 
we vary the overlap and the oversampling ratio. For this experiment, we consider unions of 
orthoblock sparse signals living on subspaces of dimension = 50 and vary p E [0.2, 0.96] 
and 5 G [l//c, 1]. An interesting result of this study is that there are regimes where the 
probability of EFS equals zero for NN but occurs for OMP with a non-trivial probability. 
In particular, we find that, when the sampling of each subspace is sparse, the gap between 
OMP and NN increases and OMP significantly outperforms NN in terms of their probability 
of EFS. Our study of EFS for structured cross-spectra suggests that the gap between EFS 
for NN and OMP will be even more pronounced for cross-spectra with superlinear decay. 



5.4 Face Illumination Subspaces 



In this section, we compare the performance of sparse recovery methods, i.e., BP and OMP, 
with NN selection for unions of illumination subspaces arising from a collection of images 
of faces under different lighting conditions. By fixing the camera center and position of the 
persons face and capturing multiple images under different lighting conditions, the resulting 



images can be well-approximated by a 5-dimensional subspace ( Ramamoorthi , 2002). 



In Fig. [2j we show three examples of the subspace affinity matrices obtained with NN, 
BP, and OMP for a collection of two different faces under 64 different illumination conditions 



from the Yale Database B (Georghiades et al. , 2001 ), where each image has been subsampled 
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Figure 7: Phase transitions for OMP and NN. The probability of EFS for orthoblock sparse 
signals for OMP (a) and NN (b) feature sets as a function of the oversampling 
ratio p = k/d and the overlap ratio 5 = q/k, where A; = 20. 

to 48 X 42 pixels, with n = 2016. In all of the examples, the data is sorted such that the 
images for each face are placed in a contiguous block. 

To generate the NN affinity matrices in the left column, we compute the normalized 
inner products between all points in the dataset and then threshold each row to select the 
k = 5 nearest neighbors to each point. To generate the affinity matrices for both OMP 
and BP, we compute the sparse representations of each point in the dataset with Alg. 
[T] with k = 5 and by solving the BP denoising (BPDN) problem in Q with k = 0.05, 
respectively. The resulting coefficient vectors are placed into the rows of a matrix C and 
then the final subspace affinity W is computed by symmetrizing the coefficient matrix, 
W = |C| + |C"^|. To compute the BP affinity matrix, we threshold each coefficient vector 
c* and select the top k = 5 coefficients (in absolute magnitude) and set the remaining 
coefficients to zero. We solved BPDN for different values of the noise tolerance parameter 
K and included the estimate that yielded the best clustering performance. In addition, we 
computed the subspace affinity matrix using the full coefficient vectors (not thresholded) 
obtained via BPDN and with the least-squares estimate for top k coefficients from the 
thresholded BP coefficients, both of which produced worse clustering results. 

In Table [Tl we display the percentage of points that resulted in EFS and the final 
clustering results for all pairs of ( ^ ) subspaces in the Yale B database, where we use the 
normalized graph cuts algorithm in ( |Shi and Malik 2000 ) on the subspace affinity matrix for 



each method. Along the top row, we display the mean and median percentage of points that 
resulted in EFS for the full dataset (all 64 illumination conditions), half of the dataset (32 
illumination conditions selected at random in each trial), and a quarter of the dataset (16 
illumination conditions selected at random in each trial). While sparse recovery methods 
admit EFS rates that are comparable to NN on the full dataset, we find that OMP and BP 
obtain higher rates of EFS than NN for both the half and quarter datasets. These results 
are in agreement with our experiments on synthetic data that suggest that sparse recovery 
methods outperform NN when the amount of oversampling is low. In the middle row, we 
display the final clustering performance of each method with normalized graph cuts. In all 
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Full-dat 


a 


Half-data 


Quarter-data 




OMP 


ii 


NN 


OMP 


h 


NN 


OMP 


£i 


NN 


EFS 


Mean 


55.48 


54.84 


60.72 


39.74 


38.8 


39.10 


21.5 


25.2 


15.14 


(in %) 


Median 


55.91 


55.91 


63.5 


39.68 


39.06 


39.06 


20.00 


25.00 


12.50 


Clustering error 


Mean 


1.43 


2.92 


22.03 


4.88 


13.40 


37.42 


9.89 


17.23 


40.15 


(in %) 


Median 


0.78 


0.78 


15.63 


1.56 


7.81 


42.19 


6.25 


12.50 


43.75 


NCut cost 


Mean 


8.43 


8.29 


6.48 


6.40 


5.55 


4.41 


4.49 


3.83 


3.02 


(ratio) 


Median 


7.91 


7.43 


5.80 


6.02 


5.09 


4.03 


4.10 


3.49 


2.81 



Table 1: Classification and EFS rates for illumination subspaces. Shown are the aggregate 
results obtained over (^2^) pairs of subspaces. 

cases, sparse recovery methods outperform NN. A surprising result is that OMP provides 
significant improvements over BP for both the half and quartered datasets. 

Since the theoretical guarantees we obtain for EFS with OMP are more pessimistic than 



the guarantees obtained for BP in ( Soltanolkotabi and Candes, 2012), it is not obvious why 



OMP provides better clustering performance than BP. To investigate the cost of segmenting 
subspace afhnity matrices obtained via the three methods, we compute the ratio of the 
absolute sum of the edges between points in the correct subspace (same face) to the absolute 
sum of the edges connecting points from different subspaces (different faces). We display this 
ratio along the bottom row of Table [T] (labeled Ncut cost). In all of our experiments, OMP 
has the highest ratio of in-cluster to out-of-cluster energies; this might be one reason why 
graph cuts performs better on subspace affinity matrices obtained via OMP in comparison 
to NN and BP. 



6. Discussion 

In this section, we provide insight into the implications of our results for different applica- 
tions of sparse recovery and compressive sensing. Following this, we end with some open 
questions and lines for future research. 



6.1 "Data Driven" Sparse Approximation 



The standard paradigm in signal processing and approximation theory is to compute a 
compact representation of a signal in a fixed and pre-specified basis or dictionary. In most 
cases, the dictionaries used to form these representations are designed according to some 
mathematical desiderata. A more recent approach has been to learn a dictionary from a 
collection of data, such that the data admit a sparse representation with respect to the 



learned dictionary (Olshausen and Field 1997: Aharon et al. , 2006). 



The applicability and utility of endogenous sparse recovery in subspace learning draws 
into question whether we can use endogenous sparse recovery for other tasks, including 
approximation and compression. The question that naturally arises is, "do we design a dic- 
tionary, learn a dictionary, or use the data as a dictionary?" Understanding the advantages 
and tradeoffs between each of these approaches is an interesting and open question. 
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6.2 Learning Block-Sparse Signal Models 



Block-sparse signals and other structured sparse signals have received a great deal of atten- 
tion over the past few years, especially in the context of compressive sensing from structured 



unions of subspaces ( Lu and Do , 2008 Blumensath and Davies , 2009 ) and in model-based 



CS ( jBaraniuk et al. 2010). In all of these settings, the fact that a class or collection of 



signals admit structured support patterns is leveraged in order to obtain improved recovery 
of sparse signals in noise and in the presence of undersampling. 

To exploit such structure in sparse signals — especially in situations where the structure 
of signals or blocks of active atoms may be changing across different instances in time, space, 
etc. — the underlying subspaces must be learned directly from the data. The methods that 
we have described for learning union of subspaces from ensembles of data can be utilized in 
the context of learning block sparse and other structured sparse signal models. 



6.3 Beyond Coherence 



While the maximum and cumulative coherence (Tropp, 2004) provide measures of the 



uniqueness of sub-dictionaries that are necessary to guarantee uniqueness and exact re- 
covery, our current study suggests that examining the principal angles formed from pairs 
of sub-dictionaries could provide an even richer description of the geometric properties of 
a dictionary. Thus, a study of the principal angles formed by different subsets of atoms 
from a dictionary might provide new insights into sparse recovery with coherent dictio- 
naries and compressive sensing from structured matrices. Finally, our empirical results in 



Section 5.3 suggest that there might exist an intrinsic difference between sparse recovery 
from dictionaries that exhibit sublinear versus superlinear decay in their principal angles or 
cross-spectra. It would be interesting to explore whether these two "classes" of dictionaries 
exhibit different phase transitions for sparse recovery. 



6.4 Discriminative Dictionary Learning 

While dictionary learning was originally proposed for learning dictionaries that admit sparse 



representations of a collection of signals (Olshausen and Field 1997 Aharon et al. , 2006), 



dictionary learning has recently been employed for classification. To use learned dictionaries 
for classification, a dictionary is learned for each class of training signals and then a sparse 
representation of a test signal is formed with respect to each of the learned dictionaries. 
The idea is that the test signal will admit a more compact representation with respect to 
the dictionary that was learned from the class of signals that the test signal belongs to. 
Instead of learning these dictionaries independently of one another, discriminative dic- 



tionary learning (Mairal et al. , 2008 Ramirez et al. , 2010), aims to learn a collection of 



dictionaries {$i, $2, • • • , that are incoherent from one another. This is accomplished 
by minimizing either the spectral or Frobenius norm of the matrix product ^f^j between 
pairs of dictionaries. This same approach is utilized in (Mailhe et al. 2012) to learn sensing 



matrices for CS that are incoherent with a learned dictionary. 

There are a number of interesting connections between discriminative dictionary learn- 
ing and our current study of EFS from collections of unions of subspaces. In particular, our 
study provides new insights into the role that the principal angles between two dictionaries 
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tell us about our ability to separate classes of data based upon their sparse representations. 
Our study of EFS from unions with structured cross-spectra suggests that the decay of 
the cross-spectra between different data classes provides a powerful predictor of the perfor- 
mance of sparse recovery methods from data living on a union of low-dimensional subspaces. 
This suggests that, in discriminative dictionary learning, it might be more advantageous to 
reduce the £i-norm of the entire cross-spectra rather than simply minimizing the maximum 



coherence and Probenius norm between points in different subspaces as in (Mairal et al 



2008) and (Ramirez et al. , 2010) respectively. To do this, each class of data must first be 
embedded within a subspace, a ONB is formed for each subspace, and then the ii- norm 
of the cross-spectra is minimized. An interesting and relevant question is how one might 
impose such a constraint in discriminative dictionary learning methods. 

6.5 Open Questions and Future Work 

In this paper, we have developed new geometric conditions sparse recovery with OMP from 
data living on unions of subspaces. While these conditions provide new insight into the 
interplay between the geometry of pairs of subspaces and the sampling of each subspace, 
our worst-case analysis provides a very pessimistic outlook on EFS when compared with 
our empirical results. This gap between the theory and experimental results suggests that 
it might be possible to tighten our guarantees for EFS with OMP, either by incorporating a 
probabilistic or random sampling model for the generation of points on each subspace data 
or by improving upon the current analysis. 

While the worst-case analysis we provide for OMP is more pessimistic than results 



provided for BP in Soltanolkotabi and Candes (2012), our empirical results suggest that 



OMP provides comparable performance to BP in terms of EFS. In some cases, OMP actually 
outperforms BP in terms of both EFS and their final classification performance on real data. 
Understanding what causes this gap in the performance of BP and OMP is a very interesting 
direction for future research. 

Other areas for future research include studying: the average-case behavior of sparse 
recovery methods from unions of subspaces, the performance of sparse recovery methods 
on noisy unions of subspaces]^ the gap between sparse recovery methods and NN methods 
for feature selection, and the role that the decay of the cross-spectra plays in EFS for both 
sparse recovery and NN methods. 

7. Proofs 

7.1 Proof of Theorem [T] 



Our goal is to prove that, if ( 11 ) holds, then it is sufficient to guarantee that EFS occurs for 
every point in when OMP is used for feature selection. We will prove this by induction. 

Consider the greedy selection step in OMP (see Alg. [T]) for a point which belongs to 
the subspace cluster 3^^. Recall that at the m^^ step of OMP, the point that is maximally 
correlated with the signal residual will be selected to be included in the feature set A. The 



6. In (Soltanolkotabi et al. 20131, they provide a robust variant of the SSC algorithm that can handle the 



case where points on multiple subspaces have been corrupted with Gaussian noise. 
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normalized residual at the m step is computed as 

_ {l-PA)yi , 
~\\{l-PA)yi\W ^ ' 

where Pa = Y\YI G M"^" is a projector onto the subspace spanned by the points in the 
current feature set A, where |A| = m — 1. 

To guarantee that we select a point from Sk, we require that the following greedy 
selection criterion holds: 

ma-x \(r"',v)\>uia.x \{r"',v)\. (22) 

We will prove that this selection criterion holds at each step of OMP by developing an 
upper bound on the RHS (the maximum inner product between the residual and a point 
outside of yk) and a lower bound on the LHS (the minimum inner product between the 
residual and a point in yk)- 

First, we will develop the upper bound on the RHS. In the first iteration, the residual 
is set to the signal of interest (yj). In this case, we can bound the RHS by the mutual 
coherence fic = maxj^j A*c(3^i, across all other sets 

max \{yi,yj)\ < i^c- 

Now assume that at the m^^ iteration we have selected points from the correct subspace 
cluster. This implies that our signal residual still lies within the span of yk, and thus we 
can write the residual r™" = z + e, where z is the closest point to r™ in and e is the 
remaining portion of the residual which also lies in Sk- Thus, we can bound the RHS as 
follows 

ma,x I (r*", yj)\ = max \ {z + e, yj) \ 

vj^yk vjfyk 

< max \ {z,yj)\ + \{e,yj)\ 
VjiVk 

<Hc+ max |(e,yj)| 

<IJLc + cos(6'o)||e||2||yi||2- 
Using the fact that cover (3^^) = e/2, we can bound the ^2-norm of the vector e as 

||e||2 = ll'^ — z\\2 

= sj\\r\\l + \\z\\l-2\{r,z)\ 

Plugging this quantity into our expression for the RHS, we arrive at the following upper 
bound 

max Kr"*, yj)\ <Hc + cos(6'o)\/2 - ^4 - < /Xc + cos(6'o) ^ 
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where the final simphfication comes from invoking the following Lemma. 
Lemma 1 For < x < 1, 



^2-^4" 



,2 < 



Proof of Lemma 1: We wish to develop an upper bound on the function 

f{x) = 2- V4-x2, for < X < 1. 

Thus our goal is to identify a function g{x), where f'{x) < g'{x) for < x < 1, and 
g{0) = /(O). The derivative of /(x) can be upper bounded easily as follows 

f(x) = , ^ < for < X < 1. 

Thus, g'{x) = x/\/3, and g{x) = x^/\/T2; this ensures that /'(x) < g'{x) for < x < 1, 
and g{0) = /(O). By the Fundamental Theorem of Integral Calculus, g{x) provides an 
upper bound for /(x) over the domain of interest where, < x < 1. To obtain the final 

result, take the square root of both sides. 



Second, we will develop the lower bound on the LHS of the greedy selection criterion. 
To ensure that we select a point from at the first iteration, we require that y^'s nearest 
neighbor belongs to the same subspace cluster. Let denote the nearest neighbor to yi 

Vnn = arg max|(7/i,yj)|. 

If and yi both lie in 3^^, then the first point selected via OMP will result in EFS. 

Let us assume that the points in yk admit an e-covering of the subspace cluster Sk, or 
that cover(3^fc) = e/2. In this case, we have the following bound in effect 



maxKr™,y,)|>\/l 



£2 



Putting our upper and lower bound together and rearranging terms, we arrive at our 
final condition on the mutual coherence 



e2 



;Uc < V 1 - -r - cos(6'o 



Since we have shown that this condition is sufficient to guarantee EFS at each step of Alg. 
[T] provided the residual stays in the correct subspace, Thm. [T] follows by induction. □ 

7.2 Proof of Theorem H 



we will assume that the union of subspaces is uniformly bounded in 



To prove Thm. [3 

accordance with (19). This assumption enables us to develop a tighter upper bound on 
the mutual coherence between the residual r £ Si and the points in yj. Since r £ Si, the 
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residual can be expressed as r = ^ia, where $j is an ONB that spans Si and a = ^Jr. 
Similarly, we can write all of the points in yj as y = ^jP, where $j is an ONB that spans 
Sj and ||/3||2 = ||y||2 = 1 because $j is a unitary matrix that preserves the ^2-norm of y. 
Let /3i = ^Jvi denote the subspace coefficients obtained by expanding yi in terms of yj and 

let Bj = denote the set of all subspace coefficients for all yi £ yj. The coherence 

between the residual and a point in a different subspace can be expanded as follows: 

\{r,y)\ ma,^jf3)\ 
max — — = max — 



y&yj \\r\\2 P&Bj \\a\\2 

= max ,, ,, 

/3eBj ||a||2 



Ka,[/sy^/3)| 

= max — 

peBj \\a\\2 

= max — 

PdBj \\a\\2 

<max^^^^||Sy^/3||i, (23) 
PdBj \\a\\2 

where the last step comes from an application of Holder's inequality, i.e., < 

llw^llooll^^lll- 



Now, we tackle the final expression in (23), which we can write as 



max = max = max (24) 

where the matrix ^jV contains the principal vectors in subspace Sj. Since we have assumed 
that the union is bounded, this implies that the inner product between all principal vectors 
and the points in yj are bounded by 7. It follows that ||(*l>jl^)"^?/||oo < 7- Now, suppose 
that 7^g < 1, where q = rank(G) = ||o"ij||o. In this case, 

max ||E($jy)^y||i < 7||(Tij||i. (25) 
Note that for bounded unions of subspaces, the term on the right can be made small by 



requiring that the bounding constant 7^1. Plugging this bound into (23), we obtain the 
following expression 

\{r,y)\ ^ II II r^alloo 

max < 7 CTij 1 r—r = 7 O-jj 1 t/ 2,2 = 7 kij 1) 

yeyj \\r\\2 ^" \\a\\2 ^ 

where this last simplification comes from the fact that U is unitary and has spectral norm 
equal to one. Note that this bound on the mutual coherence is informative only when 
7||'7ij||i < cTmax- This completes the proof. □ 
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